Optimal play in human-judged Debate usually won't answer your question

Joe Collman

4Steve Byrnes8h

Thanks! But I don’t think that’s a likely failure mode. I wrote about this long ago in the intro to Thoughts on safety in predictive learning. In my view, the big problem with model-based actor-critic RL AGI, the one that I spend all my time working on, is that it tries to kill us via using its model-based RL capabilities in the way we normally expect—where the planner plans, and the actor acts, and the critic criticizes, and the world-model models the world …and the end-result is that the system makes and executes a plan to kill us. I consider that the obvious, central type of alignment failure mode for model-based RL AGI, and it remains an unsolved problem. I think (??) you’re bringing up a different and more exotic failure mode where the world-model by itself is secretly harboring a full-fledged planning agent. I think this is unlikely to happen. One way to think about it is: if the world-model is specifically designed by the programmers to be a world-model in the context of an explicit model-based RL framework, then it will probably be designed in such a way that it’s an effective search over plausible world-models, but not an effective search over a much wider space of arbitrary computer programs that includes self-contained planning agents. See also §3 here for why a search over arbitrary computer programs would be a spectacularly inefficient way to build all that agent stuff (TD learning in the critic, roll-outs in the planner, replay, whatever) compared to what the programmers will have already explicitly built into the RL agent architecture. So I think this kind of thing (the world-model by itself spawning a full-fledged planning agent capable of treacherous turns etc.) is unlikely to happen in the first place. And even if it happens, I think the problem is easily mitigated; see discussion in Thoughts on safety in predictive learning. (Or sorry if I’m misunderstanding.)

3Steve Byrnes8h

Thanks! I think “inner alignment” and “outer alignment” (as I’m using the term) is a “natural breakdown” of alignment failures in the special case of model-based actor-critic RL AGI with a “behaviorist” reward function (i.e., reward that depends on the AI’s outputs, as opposed to what the AI is thinking about). As I wrote here: (A bit more related discussion here.) That definitely does not mean that we should be going for a solution to outer alignment and a separate unrelated solution to inner alignment, as I discussed briefly in §10.6 of that post, and TurnTrout discussed at greater length in Inner and outer alignment decompose one hard problem into two extremely hard problems. (I endorse his title, but I forget whether I 100% agreed with all the content he wrote.) I find your comment confusing, I’m pretty sure you misunderstood me, and I’m trying to pin down how … One thing is, I’m thinking that the AGI code will be an RL agent, vaguely in the same category as MuZero or AlphaZero or whatever, which has an obvious part of its source code labeled “reward”. For example, AlphaZero-chess has a reward of +1 for getting checkmate, -1 for getting checkmated, 0 for a draw. Atari-playing RL agents often use the in-game score as a reward function. Etc. These are explicitly parts of the code, so it’s very obvious and uncontroversial what the reward is (leaving aside self-hacking), see e.g. here where an AlphaZero clone checks whether a board is checkmate. Another thing is, I’m obviously using “alignment” in a narrower sense than CEV (see the post—“the AGI is ‘trying’ to do what the programmer had intended for it to try to do…”) Another thing is, if the programmer wants CEV (for the sake of argument), and somehow (!!) writes an RL reward function in Python whose output perfectly matches the extent to which the AGI’s behavior advances CEV, then I disagree that this would “make inner alignment unnecessary”. I’m not quite sure why you believe that. The idea is: actor-criti

3Towards_Keeperhood6h

Thanks! I was just imagining a fully omnicient oracle that could tell you for each action how good that action is according to your extrapolated preferences, in which case you could just explore a bit and always pick the best action according to that oracle. But nvm, I noticed my first attempt of how I wanted to explain what I feel like is wrong sucked and thus dropped it. This seems like a sensible breakdown to me, and I agree this seems like a useful distinction (although not a useful reduction of the alignment problem to subproblems, though I guess you agree here). However, I think most people underestimate how many ways there are for the AI to do the right thing for the wrong reasons (namely they think it's just about deception), and I think it's not: I think we need to make AI have a particular utility function. We have a training distribution where we have a ground-truth reward signal, but there are many different utility functions that are compatible with the reward on the training distribution, which assign different utilities off-distribution. You could avoid talking about utility functions by saying "the learned value function just predicts reward", and that may work while you're staying within the distribution we actually gave reward on, since there all the utility functions compatible with the ground-truth reward still agree. But once you're going off distribution, what value you assign to some worldstates/plans depends on what utility function you generalized to. I think humans have particular not-easy-to-pin-down machinery inside them, that makes their utility function generalize to some narrow cluster of all ground-truth-reward-compatible utility functions, and a mind with a different mind design is unlikely to generalize to the same cluster of utility functions. (Though we could aim for a different compatible utility function, namely the "indirect alignment" one that say "fulfill human's CEV", which has lower complexity than the ones humans gen

Steve Byrnes4h20

I was just imagining a fully omnicient oracle that could tell you for each action how good that action is according to your extrapolated preferences, in which case you could just explore a bit and always pick the best action according to that oracle.

OK, let’s attach this oracle to an AI. The reason this thought experiment is weird is because the goodness of an AI’s action right now cannot be evaluated independent of an expectation about what the AI will do in the future. E.g., if the AI says the word “The…”, is that a good or bad way for it to start its se... (read more)

AI for AI safety

Joe Carlsmith

20d

This is a linkpost for https://joecarlsmith.substack.com/p/ai-for-ai-safety

(Audio version here (read by the author), or search for "Joe Carlsmith Audio" on your podcast app.

This is the fourth essay in a series that I’m calling “How do we solve the alignment problem?”. I’m hoping that the individual essays can be read fairly well on their own, but see this introduction for a summary of the essays that have been released thus far, and for a bit more about the series as a whole.)

1. Introduction and summary

In my last essay, I offered a high-level framework for thinking about the path from here to safe superintelligence. This framework emphasized the role of three key “security factors” – namely:

Safety progress: our ability to develop new levels of AI capability safely,
Risk evaluation: our ability to track and forecast the level

...

(Continue Reading – 4937 more words)

Chris_Leong17h*10

Great post. I think some of your frames add a lot of clarity and I really appreciated the diagrams.

One subset of AI for AI safety that I believe to be underrated is wise AI advisors^[1]. Some of the areas you've listed (coordination, helping with communication, improving epistemics) intersect with this, but I don't believe that this exhausts the wisdom frame, especially since the first two were only mentioned in the context of capability restraint. You also mention civilizational wisdom as a component of backdrop capacity and I agree that this is a very dif... (read more)

MAD Chairs: A new tool to evaluate AI

Chris Santos-Lang

This is a linkpost for https://arxiv.org/abs/2503.20986v2

William Whewell might claim that becoming less wrong has often been accompanied by the introduction of new concepts. In that spirit, I have written a paper about a previously undiscussed fundamental game (think Coordination game and Prisoner's Dilemma) which Chris Homan and I are calling "MAD Chairs". I consider this concept less as a new invention than as a previously overlooked discovery. It happens to be relevant to AI safety--specifically, to the risk of AI displacing humanity at the top of the caste system--and I hope it can help us be less wrong about AI safety.

I will present the paper at an AAMAS workshop in Detroit on May 20, then it will be revised to incorporate the feedback received there. If anyone in the LessWrong community has insight to contribute, but will not attend the workshop, please share with me before May 20, and I will do my best to bring it to the workshop. A preprint can be found at the link (https://arxiv.org/abs/2503.20986v2).

Thomas Kwa's Shortform

Thomas Kwa

37Thomas Kwa2d

Some versions of the METR time horizon paper from alternate universes: Measuring AI Ability to Take Over Small Countries (idea by Caleb Parikh) Abstract: Many are worried that AI will take over the world, but extrapolation from existing benchmarks suffers from a large distributional shift that makes it difficult to forecast the date of world takeover. We rectify this by constructing a suite of 193 realistic, diverse countries with territory sizes from 0.44 to 17 million km^2. Taking over most countries requires acting over a long time horizon, with the exception of France. Over the last 6 years, the land area that AI can successfully take over with 50% success rate has increased from 0 to 0 km^2, doubling 0 times per year (95% CI 0.0-∞ yearly doublings); extrapolation suggests that AI world takeover is unlikely to occur in the near future. To address concerns about the narrowness of our distribution, we also study AI ability to take over small planets and asteroids, and find similar trends. When Will Worrying About AI Be Automated? Abstract: Since 2019, the amount of time LW has spent worrying about AI has doubled every seven months, and now constitutes the primary bottleneck to AI safety research. Automation of worrying would be transformative to the research landscape, but worrying includes several complex behaviors, ranging from simple fretting to concern, anxiety, perseveration, and existential dread, and so is difficult to measure. We benchmark the ability of frontier AIs to worry about common topics like disease, romantic rejection, and job security, and find that current frontier models such as Claude 3.7 Sonnet already outperform top humans, especially in existential dread. If these results generalize to worrying about AI risk, AI systems will be capable of autonomously worrying about their own capabilities by the end of this year, allowing us to outsource all our AI concerns to the systems themselves. Estimating Time Since The Singularity Early work o

Buck Shlegeris2d130

A few months ago, I accidentally used France as an example of a small country that it wouldn't be that catastrophic for AIs to take over, while giving a talk in France 😬

Rebuttals for ~all criticisms of AIXI

Cole Wyeth

3mo

Written as part of the AIXI agent foundations sequence, underlying research supported by the LTFF.

Epistemic status: In order to construct a centralized defense of AIXI I have given some criticisms less consideration here than they merit. Many arguments will be (or already are) expanded on in greater depth throughout the sequence. In hindsight, I think it may have been better to explore each objection in its own post and then write this post as a summary/centralized reference, rather than writing it in the middle of that process. Some of my takes have already become more nuanced. This should be treated as a living document.

With the possible exception of the learning-theoretic agenda, most major approaches to agent foundations research construct their own paradigm and mathematical tools which are...

(Continue Reading – 4022 more words)

Wei Dai2d20

My objection to this argument is that it not only assumes that Predictoria accepts it is plausibly being simulated by Adversaria, which seems like a pure complexity penalty over the baseline physics it would infer otherwise unless that helps to explain observations,

Let's assume for simplicity that both Predictoria and Adversaria are deterministic and nonbranching universes with the same laws of physics but potentially different starting conditions. Adversaria has colonized its universe and can run a trillion simulations of Predictoria in parallel. Again... (read more)

orthonormal's Shortform

orthonormal

orthonormal2d*10

[EDIT: Never mind, this is just Kleene's second recursion theorem!]

Quick question about Kleene's recursion theorem:

Let's say F is a computable function from ℕ^N to ℕ. Is there a single computable function X from ℕ^N to ℕ such that

X = F(X, y_2,..., y_N) for all y_2,...,y_N in ℕ

(taking the X within F as the binary code of X in a fixed encoding) or do there need to be additional conditions?

AI ALIGNMENT FORUM
AF

AI Alignment Posts

Popular Comments

Recent Discussion

1. Introduction and summary